During the second week of each unit, we’ll “walk through” a basic research workflow, or data analysis process, modeled after the Data-Intensive Research Workflow from Learning Analytics Goes to School (Krumm et al., 2018):
Each walkthrough will focus on a basic analysis guided by the social network perspective. This week, our focus will be on preparing relational data for analysis, putting together some basic network descriptives, and plotting a network visualization to help illustrate key findings.
Specifically, the Unit 1 Walkthrough will cover the following topics:
Prepare: Prior to analysis, we’ll take a look at the context from which our data came, formulate some research questions, and get introduced the {igraph} R package for network analysis.
Wrangle: Wrangling data entails the work of manipulating, cleaning, transforming, and merging data. In section 2 we focus on importing network data, converting our familiar data frames into a network object that can be analyzed and graphed, and learn about “simple graphs.”
Explore: In section 3, we calculate some basic network descriptives and learn how to illustrate some of these stats through network visualization.
Model: While we won’t dig into approaches for modeling network data until Unit 3, we will take a quick look at some approaches used in the study guiding this walkthrough.
Communicate: We’ll learn more about communicating key findings next week, but for now you will be introduced to the basic components of a data product.
In Social Network Analysis and Education: Theory, Methods & Applications, Carolyn (2013) notes that:
the social network perspective is one concerned with the structure of relations and the implication this structure has on individual or group behavior and attitudes
More specifically, Carolyn cites the following four features used by Freeman (2004) to define the social network perspective:
Social network analysis is motivated by a relational intuition based on ties connecting social actors.
It is firmly grounded in systematic empirical data.
It makes use of graphic imagery to represent actors and their relations with one another.
It relies on mathematical and/or computational models to succinctly represent the complexity of social life.
For Unit 1, our walkthrough will be guided by previous research and evaluation work conducted by the Friday Institute for Educational Innovation as part of the Massively Open Online Courses for Educators (MOOC-Ed) initiative. The study introduced next and the hands-on analysis with R in this walkthrough will help to illustrate these four defining features of the social network perspective.
Take a quick look at the Description of the Dataset section from the Massively Open Online Course for Educators (MOOC-Ed) network dataset BJET article and the accompanying data sets stored on Harvard Dataverse that we’ll be using for this walkthrough.
In the space below, type a brief response to the following questions:
What were some of the steps necessary to construct this dataset?
What two “node attributes” from the dataset that might be useful for predicting participants who may be more engaged or central to the network? Why did you select those two?
What else do you notice/wonder about this dataset?
A Social Network Perspective on Peer Supported Learning in MOOC-Eds was framed by three primary research questions related to peer supported learning:
What are the patterns of peer interaction and the structure of peer networks that emerge over the course of a MOOC-Ed?
To what extent do participant and network attributes (e.g., homophily, reciprocity, transitivity) account for the structure of these networks?
To what extent do these networks result in the co-construction of new knowledge?
For our very first walkthrough, we are going to focus exclusively on RQ1 from the original study and our question of interest about our discussion network is:
To what extent, did educators engage with other participants in the discussion forums?
Based on what you know about networks and the context so far, what other research questions might ask we ask in this context that a social network perspective might be able to answer?
In the space below, type a brief response to the following questions:
-
We’ll revisit your response towards the end and provide an opportunity to refine your research question after you know the data a little better.
As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR):
Packages are shareable collections of R code that can contain functions, data, and/or documentation. Packages increase the functionality of R by providing access to additional functions to suit a variety of needs.
You can always check to see which packages have already been installed and loaded into RStudio Cloud by looking at the the Files, Plots, & Packages Pane in the lower right hand corner of RStudio as shown in the following screenshot:
You should see installed some familiar tidytext packages from our Getting Started Walkthrough like {dplyr} and {readr} which we’ll be using again shortly. You should also see an important package call {igraph} that we will rely on heavily for our network analyses in this course.
If you are working in RStudio Desktop, or notice that the packages have not been installed and/or loaded, run the following install.packages() function code to install the {tidyverse} and {igraph} packages:
install.packages("tidyverse")
install.packages("igraph")
Let’s go ahead and use the library() function for the {tidyverse} package and review which packages from the tidyverse collection of packages that this package also loads.
Click the green arrow to run the following code and load our packages:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.1 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
For our Unit 1 Walkthrough, we will rely heavily on the igraph network analysis package. The main goals of the igraph package and the collection of network analysis tools it contains are to provide a set of data types and functions for:
pain-free implementation of graph algorithms,
fast handling of large graphs, with millions of vertices (i.e., actors or nodes) and edges,
allowing rapid prototyping via high level languages like R.
Run the code chunk below to load the {igraph} library:
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
Take a look at the messages from the output after loading the igraph library. What tidyverse packages share identically named functions with igraph?
Write your response in the space below.
-
Congrats! You’re reading to start wrangling some data! Before proceeding further, knit your document and check to see if you encounter any errors.
In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from the raw data to a dataset that can be explored and modeled (Krumm et al, 2018).
For our data wrangling this week, we’re keeping it simple since working with network data is a bit of a departure from our working with rectangular data frames. Our primary goals for Unit 1 are learning how to:
Import Data. An obvious and also important first step, we need to “read” our data into R and learn about formatting for edge-lists and node attribute files.
Create a Network Object. Before performing network analyses, we’ll need to convert our data frames into special data format for working with relational data.
Simplify Network. Finally, we’ll learn about a handy simplify() function in the {igraph} package for collapsing multiple ties between actors and removing “self-loops.”
To get started, we need to import, or “read”, our data into R. The function used to import your data will depend on the file format of the data you are trying to import, but R is pretty adept at working with many files types.
Take a look in the /data folder in your Files pane. You should see the following .csv files:
dlt1-edgelist.csv
dlt1-nodes.csv
As its name implies, the first file dlt1-edgelist.csv is an edge-list that contains information about each tie, or relation between two actors in a network. In this context, a “tie” is a reply by one participant in the discussion forum to the post of another participant – or in some cases to their own post! These ties between a single actor are called “self-loops” and as we’ll see later in this section, igraph has a special function to remove these self loops from a sociogram, or network visualization.
The edge-list format is slightly different than other formats you have likely worked with before in that the values in the first two columns each row represent a dyad, or tie between two nodes in a network. An edge-list can also contain other information regarding the strength, duration, or frequency of the relationship, sometime called “weight”, in addition to other “edge attributes.”
In addition to our Sender and Reciever dyad pairs, our DLT 1 dataset contains the following edge attributes:
Sender = Unique identifier of author of comment
Receiver = Unique identifier of identified recipient of comment
Timestamp = Time post or reply was posted
Parent = Primary category or topic of thread
Category = Subcategory or subtopic of thread
Thread_id = Unique identifier of a thread
Comment_id = Unique identifier of a comment\
Let’s use the read_csv() function from the {readr} package introduced in the Getting Started walkthrough to read in our edge-list and print the new ties data frame:
ties <- read_csv("data/dlt1-edgelist.csv",
col_types = cols(Sender = col_character(),
Receiver = col_character(),
`Category Text` = col_skip(),
`Comment ID` = col_character(),
`Discussion ID` = col_character()))
ties
## # A tibble: 2,529 x 9
## Sender Receiver Timestamp `Discussion Tit… `Discussion Cat… `Parent Categor…
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 360 444 4/4/13 16… Most important … Group N Units 1-3 Discu…
## 2 356 444 4/4/13 18… Most important … Group D-L Units 1-3 Discu…
## 3 356 444 4/4/13 18… DLT Resources—C… Group D-L Units 1-3 Discu…
## 4 344 444 4/4/13 18… Most important … Group O-T Units 1-3 Discu…
## 5 392 444 4/4/13 19… Most important … Group U-Z Units 1-3 Discu…
## 6 219 444 4/4/13 19… Most important … Group M Units 1-3 Discu…
## 7 318 444 4/4/13 19… Most important … Group M Units 1-3 Discu…
## 8 4 444 4/4/13 19… Most important … Group N Units 1-3 Discu…
## 9 355 356 4/4/13 20… DLT Resources—C… Group D-L Units 1-3 Discu…
## 10 355 444 4/4/13 20… Most important … Group D-L Units 1-3 Discu…
## # … with 2,519 more rows, and 3 more variables: Discussion Identifier <chr>,
## # Comment ID <chr>, Discussion ID <chr>
Note the addition of the col_types = argument for changing the column types to character strings since the numbers for those particular columns indicate actors (Sender and Reciever) and attributes (Comment_ID and Discussion_Id). We also skipped the Category Text since this was left blank for deidentification purposes.
RStudio Tip: Importing data and dealing with data types can be a bit tricky, especially for beginners. Fortunately, RStudio has an “Import Dataset” feature in the Environment Pane that can help you use the {readr} package and associated functions to greatly facilitate this process.
Consider the example pictured below of a discussion thread from the Planning for the Digital Learning Transition in K-12 Schools (DLT 1) where our data orginated. This thread was initiated by participant I, so the comments by J and N are considered to be directed at I. The comment of B, however, is a direct response to the comment by N as signaled by the use of the quote-feature as well as the explicit mentioning of N’s name within B’s comment.
Now answer the following questions as they relate to the DLT 1 edge-list we just read into R.
Which actors in this thread are the Sender and the Reciever? Which actor is both?
How many dyads are in this thread? Which pairs of actors are dyads?
Sidebar: Unfortunately, these types of nuances in discussion forum data as illustrated by this simple example are rarely captured through automated approaches to constructing networks. Fortunately, the dataset you are working with was carefully reviewed to try and capture more accurately the intended recipients of each reply.
The second file we’ll be using contains all the nodes or actors (i.e., participants who posted to the discussion forum) as well as some of their attributes such as gender and years of experience in education.
Carolyn (2013) notes that most social network analyses include variables that describe attributes of actors, ones that are either categorical (e.g., sex, race, etc.) or continuous in nature (e.g., test scores, number of times absent, etc.). These attributes that can be incorporated into a network graph or model, making it more informative and can aid in testing or generating hypotheses.
These attribute variables are typically included in a rectangular array, or dataframe, that mimics the actor-by-attribute that is the dominant convention in social science, i.e. rows represent cases, columns represent variables, and cells consist of values on those variables.
As an aside, Carolyn also refers to this historical preference by researchers for “actor-by-attribute” data, in the absence of relational data in which the actor has been removed their social context, as the “sociological meatgrinder” in action. Specifically, this historical approach assumes that the actor does not interact with anyone else in the study and that outcomes are solely dependent of the characteristics of the individual.
Regardless, let’s read in our node attribute file and take a look at the actors and their attributes included in our dataset:
actors <- read_csv("data/dlt1-nodes.csv",
col_types = cols(UID = col_character(),
Facilitator = col_character(),
expert = col_character(),
connect = col_character()))
Use the code chunk below and a function of your choosing to take a look at the actors data frame:
Match up the attributes included in the node file with the following codebook descriptors. The first one has been done as an example.
Facilitator = Identification of course facilitator (1 = instructor)Before we can begin using many of the functions from the {igraph} package for summarizing and visualizing our DLT 1 network, we first need to convert the data frames that we imported into an igraph network object, or an igraph graph. 🤷♂
To do that, we will use the graph_from_data_frame() function. Note that I included the eval=FALSE argument in the code block below to prevent this code from running when we knit our final document. Otherwise it will produce an error since we can’t include help documentation in our knitted HTML file.
Run the following code to take a look at the help documentation for this function:
?graph_from_data_frame
You probably saw that this particular function takes the following three arguments, two of which are data frames:
d describes the edges of the network. The first two columns are the IDs of the source and the target node for each edge, in our case the Sender and Reviever of a discussion post – the order matters! The following columns are edge attributes such as weight, type, label, or anything else.
vertices starts with a column of node IDs and any following columns are interpreted as node attributes.
directed determines whether or not to create a directed graph.
Run the following code to specify our ties data frame as the edges of our network, our actors data frame for the vertices of our network and their attributes, and indicate that this is indeed a directed network.
network <- graph_from_data_frame(d = ties,
vertices = actors,
directed = T)
network
## IGRAPH f432851 DN-- 445 2529 --
## + attr: name (v/c), Facilitator (v/c), role1 (v/c), experience (v/n),
## | experience2 (v/c), grades (v/c), location (v/c), region (v/c),
## | country (v/c), group (v/c), gender (v/c), expert (v/c), connect
## | (v/c), Timestamp (e/c), Discussion Title (e/c), Discussion Category
## | (e/c), Parent Category (e/c), Discussion Identifier (e/c), Comment ID
## | (e/c), Discussion ID (e/c)
## + edges from f432851 (vertex names):
## [1] 360->444 356->444 356->444 344->444 392->444 219->444 318->444 4 ->444
## [9] 355->356 355->444 4 ->444 310->444 248->444 150->444 19 ->310 216->19
## [17] 19 ->444 19 ->4 217->310 385->444 217->444 393->444 217->19 256->219
## + ... omitted several edges
Carolyn (2013) reminds us that one of the simplest and often ignored structural property of a social network is its size and explains that:
size is simply a measure of the number of nodes in the network.
He notes that the size of a network plays an important role in determining what happens in the network. For example, in a classroom of 30 students, it is not hard to imagine that the pattern of who communicates with whom will look much different than if the network consisted of hundreds or even thousands of students like in a MOOC.
Take a look at the very first line of the output which contains some basic information about our network and answer the following questions:
How many nodes and edges are in our network? Is this consistent with the number of observations in our data frames? Hint: Check the Environment pane.
The “D” and the “N” indicate that this is a Directed network and has the Name vertex attributes set. Why do the two spaces that follow these letters have dashes? Hint: check the help files.
Which vertex attribute did igraph interpret as numeric?
As you saw from the network output, our dataset has 2529 edges or ties and just a quick scan of the edges in the network shows that edges like 356 -> 444 occur at least more than once. So we know that participant 356 has replied to participant 444 at least twice.
Fortunately, the {igraph} package has a simplify() function for collapsing multiple edges so they are not represented more than once when we want visually depict our network with a sociogram.
Let’s use that function to simplify our network and save it as a simple_network, or a simple graph, which contains no self-loops or duplicate edges and which by default the simplify() function removes:
simple_network <- simplify(network, remove.loops = TRUE)
simple_network
## IGRAPH 6bcd645 DN-- 445 1936 --
## + attr: name (v/c), Facilitator (v/c), role1 (v/c), experience (v/n),
## | experience2 (v/c), grades (v/c), location (v/c), region (v/c),
## | country (v/c), group (v/c), gender (v/c), expert (v/c), connect (v/c)
## + edges from 6bcd645 (vertex names):
## [1] 1->2 1->7 1->22 1->30 1->36 1->41 1->49 1->50 1->68 1->88
## [11] 1->92 1->109 1->112 1->137 1->144 1->154 1->161 1->192 1->195 1->198
## [21] 1->221 1->444 1->445 2->36 2->67 2->104 2->177 2->223 3->2 3->7
## [31] 3->223 3->310 4->5 4->7 4->26 4->29 4->98 4->107 4->193 4->198
## [41] 4->207 4->308 4->444 5->8 5->12 5->21 5->24 5->67 5->107 5->444
## [51] 5->445 6->5 6->7 6->11 6->41 6->42 6->62 6->68 6->100 6->116
## + ... omitted several edges
Note that simplify() removes self-loops by default, this does not really need to be included. If you wanted to keep them, you would simply set this to FALSE.
Take a look at the output for our simple graph now and answer the following questions:
How many unique edges are in the network? Why do you think this is considerably less than our total edges?
Did we potentially lose any important or useful information by collapsing multiple edges into a single edge or by removing self-loops?
We noted earlier that edges can also contain attributes such as strength, duration or frequency, sometime called “weight.” These weights can not only help us better understand the relationship between two actors, but also aid in visualization and modeling later on.
When we used the simplify() function earlier, it collapsed our duplicate edges but we lost some vital information as a result, namely the frequency of replies among pairs of educators in our discussion forum.
Fortunately, the simplify() function contains an argument that will allow us to count the number of ties between two actors, similar to how we might use the count() function in the {dplyr} package like so:
edge_weights <- count(ties, Sender, Receiver)
edge_weights
## # A tibble: 1,978 x 3
## Sender Receiver n
## <chr> <chr> <int>
## 1 1 109 1
## 2 1 112 1
## 3 1 137 1
## 4 1 144 2
## 5 1 154 1
## 6 1 161 1
## 7 1 192 2
## 8 1 195 1
## 9 1 198 1
## 10 1 2 1
## # … with 1,968 more rows
In this case, we see that participant 1 replied to participant 144 twice throughout the course.
To add weights to our simplified network, we first need to add a weight variable to the edges in our original network igraph object.
The {igraph} package has a unique syntax for working with attributes of network objects. To add a weight attribute to the E() edges in our network we’ll use the $ operator which can be used to create a new weight variable – or select a variable as we’ll see later on – and we’ll use the <- assignment operator to add an initial value of 1 for the weight of each edge.
Let’s put that all together and run the code to add a weight of 1 to each edge in our network
E(network)$weight <- 1
Now let’s take a look at our igraph network object again:
network
## IGRAPH f432851 DNW- 445 2529 --
## + attr: name (v/c), Facilitator (v/c), role1 (v/c), experience (v/n),
## | experience2 (v/c), grades (v/c), location (v/c), region (v/c),
## | country (v/c), group (v/c), gender (v/c), expert (v/c), connect
## | (v/c), Timestamp (e/c), Discussion Title (e/c), Discussion Category
## | (e/c), Parent Category (e/c), Discussion Identifier (e/c), Comment ID
## | (e/c), Discussion ID (e/c), weight (e/n)
## + edges from f432851 (vertex names):
## [1] 360->444 356->444 356->444 344->444 392->444 219->444 318->444 4 ->444
## [9] 355->356 355->444 4 ->444 310->444 248->444 150->444 19 ->310 216->19
## [17] 19 ->444 19 ->4 217->310 385->444 217->444 393->444 217->19 256->219
## + ... omitted several edges
We can see that our network is now weighted as indicated by the “W” and that our new weight attribute has been added.
We can now use the edge.attr.comb = argument to “sum” the weights for each occurrence of a pair of actors, so if 1 replied to participant 144 five times over the course of the MOOC-Ed, there would be a weight of 5 for that pair.
Run the code to simplify our weighted network:
weighted_network <- simplify(network,
edge.attr.comb = list(weight="sum")
)
Let’s take a look at the output and ignore the error message for now:
weighted_network
Take a look at the output for our simple graph now and answer the following questions:
How does the number of total edges and unique edges this compare to the totals reported for the DLT 2 course in our guiding study?
Congrats! You made it to the end of data wrangling section and are ready to start analysis! Before proceeding further, knit your document and check to see if you encounter any errors.
As noted in the Getting Started Walkthrough, exploratory data analysis involves the processes of describing your data (such as by calculating the means and standard deviations of numeric variables, or counting the frequency of categorical variables) and, often, visualizing your data prior to modeling.
In Section 3, we will learn some new functions for retrieving basic network descriptives related to our research question and create a network visualization to help illustrate key findings. Specifically, in this section we’ll learn to:
Examine Basic Descriptives. We focus primarily on actors and edges in this walkthrough, including the edges wights we added in the previous section as well as node degree, and import and fairly intuitive measure of centrality.
Make a Sociogram. Finally, we wrap up the explore phases by learning to plot a network and tweak key elements like the size, shape, and position of nodes and edges to better at communicating key findings.
Many analyses of social networks are primarily descriptive. As Carolyn (2013) notes that these descriptive studies aim either to represent the network’s underlying social structure through data-reduction techniques or to characterize network properties through network measures.
A key structural property of networks is the concept of centralization. A network that is highly centralized is one in which relations are focused on a small number of actors or even a single actor in a network, whereas ties in a decentralized network are diffuse and spread over a number of actors. One of the most common descriptives reported in network studies and a primary measure of centralization is degree.
Degree is the number of ties to and from an ego. In a directed network, in-degree is the number of ties received, whereas out-degree is the number of ties sent.
The {igraph} package has an aptly named function degree() for retrieving degree, in-degree, and out-degree for all actors in a network.
Run the following code to extract measures and save to node_degree which we’ll examine in just a bit:
node_degree <- degree(weighted_network, mode = "all")
Note. We set the mode = argument in this function to “all” which give us the total number of participants that each actor in our network with sent or received a reply.
Let’s take a look at the distribution of node_degree in our network by using R’s built in hist() function for creating histograms. I set the value of breaks =, or bins in our histogram, to 30 since I already know some actors in this network have a very high degree.
hist(node_degree, breaks = 30)
We can see that most actors in the network are connected to very few individuals while a couple actors in this network are connected to a very larger number, nearly 200 and 350 respectively!
Now let’s take a look at the mean and median for node_degree using some other {base} R functions:
mean(node_degree)
## [1] 8.701124
median(node_degree)
## [1] 4
We see that the mean suggests the participants are, on average, connected to about 8 other participants in the MOOC-Ed, but this is likely heavily skewed by the two individuals with a disproportionate number of connections. The median is probably a better characterization of the typical number of educators a participant has sent or received a reply.
Let’s go ahead and take a look at in-degree next:
in_degree <- degree(weighted_network, mode="in")
hist(in_degree, breaks = 30)
mean(in_degree)
## [1] 4.350562
median(in_degree)
## [1] 1
Again, most participants received a reply from small number of individuals.
Use the code chunk below to examine the distribution, mean and median of out_degree:
In the space below, write your interpretation of these results.
-
Finally, let’s also take a look at the distribution, mean and median of the edge weights we added to our graph.
Recall from earlier that the {igraph} package has a unique syntax for accessing node and edge attributes. For edges we use E() and included the name of the network object we want to use, followed by the $ operator to select the attribute.
weights <- E(weighted_network)$weight
hist(weights, breaks = 10)
mean(weights)
## [1] 1.26188
median(weights)
## [1] 1
It looks like the vast majority or participants either sent or received a reply from just one other participant.
The use of the $ is actually standard across R and a very useful operator.
Use the code chunk below to create a histogram and calculate the mean and median from the edge weighs created earlier in the Add Edge Weights section by using the edge_weights data frame, the $ operator, and column n which contains the counts for each unique edge.
Are these results consistent with our summary of edge weights we created above?
-
If you recall from our 1a. Review the Research section, one of the defining characteristics of the social network perspective is its use of graphic imagery to represent actors and their relations with one another. To emphasize this point, Carolyn (2013) reported that:
The visualization of social networks has been a core practice since its foundation more than 90 years ago and remains a hallmark of contemporary social network analysis.
Network visualization can be used for a variety of purposes, ranging from highlighting key actors to even serving as works of art. This excellent figure from Katya Ognyanova’s also excellent tutorial on Static and Dynamic Network Visualization with R helps illustrate the variety of goals a good network visualization can accomplish:
These visual representations of the actors and their relations, i.e. the network, are called a sociogram. Actors who are most central to the network, such as those with higher node degrees, are usually placed in the center of the sociogram and their ties are placed near them. As we’ll see in just a bit, those two actors with hundreds of ties will be placed by most graph layout algorithms in the center of the graph.
For Unit 1, we’ll be using the plot() function from R’s built in {graphics} package to make a sociogram.
Let’s run the code and see what we get without any tweaking and see what the plot function produces:
plot(weighted_network)
If this had been a smaller network like one generated from a teacher professional development workshop, this might have been useful, but for large networks like our MOOC-Ed discussion forums, this doesn’t communicate much. In fact, it’s visualizations like these that give sociograms the unflattering nickname of “hair ball” plots.
Fortunately, the {igraph} package includes a plethora of plotting parameters for improving the layout and visual design of network graphs!
There are many ways to modify vertices and edges in a sociogram to improve readability. Up to four attributes of each actor can be layered onto the sociogram by altering the color, shape, label, and size of the symbol. And attributes for edges can be represented in a range of ways as well.
One quick fix is to simply remove the labels by adding the vertex.label = argument and setting it to NA.
Since the number labels don’t provide much useful information, let’s go ahead and remove them:
plot(weighted_network,
vertex.label = NA)
We can now at least see the nodes, or vertices, but with 445 participants in the forums, many are masked by the default size of each node.
By default, the size of nodes is 15. Let’s add the vertex.size = argument and change that to 1:
plot(weighted_network,
vertex.label = NA,
vertex.size = 1)
A little better, but let’s use the node_degree we calculated to emphasize those wiht greater connections in the course by substitute node_degree for a specific value.
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree)
Whoops! This certainly emphasizes the disproportionate number of ties among these two participants, but not quite what we’re going for.
Since node_degree is just a list of values for each node we can treat it like a number. Let’s divide that by 10 to reduce the relative size:
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree/10)
That’s better, and helps highlight our two main actors without overshadowing the rest, but those arrows are definitely an eyesore.
Let’s add the edge.arrow.size = argument and dramatically reduce:
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.1,
edge.arrow.size = .04)
Much better, now let’s deal with the size of the edges.
We can add the edge.width = argument to help minimize edge overlap yet still make them visible. This is a bit of trial and error but let’s give .2 a try since the default is 1:
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.05,
edge.arrow.size = .04,
edge.width = .2)
Note that we could also set edge.with = to the weight of each edge, similar to how we used degree for the size of each node.
Let’s give that a try:
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.1,
edge.arrow.size = .04,
edge.width = E(weighted_network)$weight)
Not much of an improvement, if at all, given how many edges are in our network.
Let’s try dividing that weight by 5 similar to reduce like with did with the vertex.size = argument:
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.05,
edge.arrow.size = .05,
edge.width = E(weighted_network)$weight/5)
Better, but since we know most edges have a weight of 1 there really isn’t too much to work with here.
Let’s move on to changing graph layout!
One of the major advances in visualization since the first hand-drawn sociograms developed by Jacob Moreno (1934) to represent relations among children in school is the use of software and algorithms to automatically layout networks on a grid.
There are may different layout methods, but we’ll start with the Fruchterman-Reingold (FR) layout, which is one of the most used layout algorithms for network visualization. These types of force-directed algorithms generally work well with large networks and try to layout graphs in “an aesthetically-pleasing way” by making edges roughly equal in length and minimizing overlap.
Let give the FR layout a try by using the layout = argument and specifying layout_with_fr:
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.05,
edge.arrow.size = .05,
edge.width = E(weighted_network)$weight/5,
layout = layout_with_fr)
By default, {igraph} uses layout_nicely, a smart function that chooses a layout based on the graph. Given the similarity with our previous graph, I’m assuming it chose the the FR layout as well.
Let’s try one more just for fun:
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.05,
edge.arrow.size = .05,
edge.width = E(weighted_network)$weight/5,
layout = layout_in_circle)
Interesting but not terribly useful!
There are many other igraph layout functions, some more useful than others depending on the context. We’ve only scratched the surface of what’s possible with network plotting but hopefully this have given you a sense of what required to avoid the dreaded “hair ball” plot and create a network viz that communicates something useful.
Try modifying the code below by tweaking the included arguments/parameters or adding new arguments/parameters to make our plot either more “aesthetically pleasing” or more purposeful in what it’s trying to communicate.
You’re also welcome to try out some of the other graphing options from from Katya Ognyanova’s tutorial on Static and Dynamic Network Visualization with R.
There are no right or wrong answers, just have some fun trying out different approaches!
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.05,
edge.arrow.size = .05,
edge.width = E(weighted_network)$weight/5,
layout = layout_with_fr)
Congrats! You made it to the end of the Explore section and are ready to learn a little about network modeling! Before proceeding further, knit your document and check to see if you encounter any errors.
As highlighted in Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.
We will not explore the use of models for SNA until Unit 3, but recall from A Social Network Perspective in MOOC-Eds was guided by the following questions:
What are the patterns of peer interaction and the structure of peer networks that emerge over the course of a MOOC-Ed?
To what extent do participant and network attributes (e.g., homophily, reciprocity, transitivity) account for the structure of these networks?
To what extent do these networks result in the co-construction of new knowledge?
To address Question 1, actors in the network were categorized into distinct mutually exclusive groups using the core-periphery and regular equivalence functions of UCINET. The former used the CORR algorithm to divide the network into actors that are part of a densely connected subgroup, or “core”, from those that are part of the sparsely connected periphery. Regular equivalence employs the REGE blockmodeling algorithm to partition, or group, actors in the network based on the similarity of their ties to others with similar ties. In essence, blockmodeling provides a systematic way for categorizing educators based on the ways in which they interacted with peers.
As we saw upon just a basic visual inspection of our network during the Explore section, there was a small core of highly connected participants surrounded by those on the “periphery,” or edge, of the network with very few connections. In the DLT 2 course, those on the periphery made up roughly 90% of network! The study also found relatively high levels of reciprocation, but also found that roughly a quarter of participants were characterized as “brodcasters” – educators who initiated a discussion thread, but neither reciprocated with those who replied, nor posted to threads initiated by others.
To address Question 2, this study use the exponential family of random graph models (ERGM; also known as p* models), which provide a statistical approach to network modeling that addresses the complex dependencies within networks. ERGMs predict network ties and determine the statistical likelihood of a given network structure, based on an assumed dependency structure, the attributes of the individuals (e.g., gender, popularity, location, previous ties) and prior states of the network.
Recall from the 1a. Review the Research that you were asked to identify two “node attributes” from the dataset that might be useful for predicting participants who may be more engaged or central to the network.
Take look at page 276 of A social network perspective on peer supported learning in MOOCs for educators. Were your predictions correct?
-
The final step in our workflow/process is sharing the results of analysis with wider a audience. Krumm et al. (2018) have outline the following 3-step process for communicating with education stakeholders what you have learned through analysis:
Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”
Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.
Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question.
Next week we’ll take a look at refining our analysis and ways we might communicate and share findings with education stakeholders.
Now that you’ve become more familiar with this dataset and the social network perspective, what other aspects of this dataset, or a dataset you are interested in exploring, could you investigate?
-
What specific research questions might you ask that would be helpful for being understanding and improving learning, or the context in which the data is collected?
-
Congrats! You’ve finished the Unit 1 Guided Walkthrough and are ready for some independent analysis next week!
To complete this assignment, knit your document and send me an email at sbkellog@ncsu.edu letting me know you’re all set.